How Well can we Predict the Popularity of a Song?

Nate Lindley, Aidan Hijazi-Klop, Alex Williams, Lyle Johnson

12/10/2022

Background

In parsing data sets to find one usable for the purposes of this project, our group settled upon using an extensive database of spotify songs (tracks) (232,725 entries across 18 features). The features include a number of acoustic characteristics (loudness, tempo, valence, time signature, etc.), a popularity metric (from 0 to 100), and other identifying features like genre and names of the artist and track. All of the members of our team are deeply interested in music and saw this data set, given its expansiveness, as a good platform to build our work upon.

The dataset can be found here: Kaggle - Spotify Tracks DB
With additional credit for work to grab the data: Github - Spotify Data Project

Can we use the quantifiable metrics or characteristics that defined a song to predict song popularity once it is released under the label of a specific genre?

Our group settled on this question as popularity was the metric we saw as most likely determined based on the other features present in the selected dataset. Stepping back from the specific data at hand, as well, popularity is a metric that is pertinent both to businesses - be that independent artists or recording labels - and us as consumers - as popularity undeniably plays a role in what we consume; the relevance to both artist and consumer make a worthy subject for a our research question.

Music Characteristics

Per the Spotify Developer API (linked here) a single track has 13 unique features which are quantifiable and contribute to the audio profile. These are detailed in the table below with descriptions and possible value ranges.

There are 5 additional features that are returned by a GET call to the Spotify Dev API that are useful in labeling and managing data throughout the analytics process but are not of predictive value. These are detailed in second table below. Not all of these misc. features are present in the kaggle dataset used, but were included here for complete context.

Table 2. Misc. features from API
Characterstic Description Value Range
14 analysis_url A URL to access the full audio analysis of this track. An access token is required to access this data. NA
15 id The Spotify ID for the track. NA
16 track_href A link to the Web API endpoint providing full details of the track. NA
17 type The object type. Allowed value:“audio_features” NA
18 uri The Spotify URI for the track. NA

Previous Work

From our research there was one example of a pre-existing project that was targeting the same question as we were. The work in question was conducted by Matt Devor; operating based on the same 13 characteristics we also identified from the API, Devor performed a range of data analysis and built towards running a Linear Regression model to directly predict popularity. Devor’s work is limited in the nature of the concluding model, but in lieu of that it does serve as an extensive reference for exploratory data analysis which our team looks to build upon in our work.

Reference: Github - Predicting Spotify Song Popularity

Our Approach

In order to address the landed upon question from above - exploring the predictive power of the acoustic characteristics of an audio track to determine a track’s popularity - in a manner unique from existing work our team will use a random forest model; the model will perform regression over the continuous target variable popularity. Through exploratory analytics and further method decisions we seek to create a model with a sub 15% NRMSE for potential use in a real-world business context.

Preliminary Data Analysis - Count of Songs in Each Genre

Final Data Prep (factorize all character columns and cast to numeric all numeric columns)

factor_cols <- c(7, 10, 13)
final_pop[,factor_cols] <- lapply(final_pop[,factor_cols], as_factor)
final_pop$duration_ms <- as.numeric(final_pop$duration_ms)
final_pop$popularity <- as.numeric(final_pop$popularity)

Preliminary Data Analysis - Correlation Plot Between Features

Preliminary Data Analysis - Loudness Plot

Preliminary Data Analysis - Danceability Plot

Preliminary Data Analysis - Acousticness Plot

Random Forest with All Genres - First Model

## [1] "Initial mtry value: 3.60555127546399"
## [1] "RMSE of our first RF model is: 15.4594981812402"

Look at variable importance

Overall
acousticness 3841.0445
danceability 2250.1343
duration_ms 2993.5943
energy 2658.1052
instrumentalness 1599.8494
key 3672.9501
liveness 2068.6330
loudness 3687.1308
mode 200.6078
speechiness 2003.7502
tempo 1661.8719
time_signature 377.4922
valence 2046.0900

Optimizing the model

To optimize the model we tuned the mtry hyperparameter. We found that when run on our tune dataset 2 was a more accurate mtry value as opposed to 4 when we had originally created the model. In response to this we created another model with the mtry set to 2. This gave better results (both rmse and nrmse) when run on our test dataset.

## mtry = 4  OOB error = 231.1599 
## Searching left ...
## mtry = 2     OOB error = 229.1461 
## 0.008711894 0.05 
## Searching right ...
## mtry = 12    OOB error = 232.2126 
## -0.004553863 0.05

Creating the Tuned Model

## [1] "RMSE of original model: 15.4594981812402"
## [1] "RMSE of optimized model: 14.5223043796896"

Testing the Model

After testing the model we found an rmse of around 14.1 and a nrmse of around .16

## [1] "pop_RF_2 rmse: 14.1544440007784"
## [1] "pop_RF_2 test NRMSE: 0.164586558148585"

Exploration With Focused Data

We followed the same steps as above when working with 3 subsets of our original dataset: rap music, jazz music, and country music.

Creating Models for Rap Music

## [1] "Starting mtry for rap dataset: 3.60555127546399 rounds to 4"
## [1] "RMSE for first Rap RF model: 8.07926310206631"
## mtry = 4  OOB error = 67.62196 
## Searching left ...
## mtry = 2     OOB error = 67.34297 
## 0.004125716 0.05 
## Searching right ...
## mtry = 12    OOB error = 68.69809 
## -0.01591392 0.05

Tuned Rap Model

## [1] "rap_RF_2 RMSE: 8.11214128973794"
## [1] "rap_RF_2 NRMSE: 0.124802173688276"
## 
## Call:
##  randomForest(formula = as.numeric(popularity) ~ ., data = rap_train,      ntree = 500, mtry = 5, replace = TRUE, sampsize = 400, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 65.80684
##                     % Var explained: 0.91

Speechiness Visualization

Danceability Visualization

Rap Model Testing

## [1] "rap_RF_2 test RMSE: 8.17224959688551"
## [1] "rap_RF_2 test NRMSE: 0.0961441129045354"

Creating Models for Jazz Music

## [1] "RMSE of first Jazz Model: 9.10492653612027"
## mtry = 4  OOB error = 86.96277 
## Searching left ...
## mtry = 2     OOB error = 83.29836 
## 0.04213769 0.05 
## Searching right ...
## mtry = 8     OOB error = 85.91315 
## 0.01206981 0.05

Optimized Jazz Model

## [1] "jazz_RF_2 RMSE: 9.11880324260826"
## [1] "jazz_RF_2 NRMSE: 0.124915112912442"

Jazz Song Popularity vs. Instrumentalness

Jazz Model Testing

## [1] "jazz_RF_r test RMSE: 9.51695550165936"
## [1] "jazz_RF_2 test NRMSE: 0.120467791160245"

Creating Models for Country Music

## [1] "RMSE of first Country Model: 9.50881642786787"
## mtry = 4  OOB error = 94.18628 
## Searching left ...
## mtry = 2     OOB error = 91.8066 
## 0.02526563 0.05 
## Searching right ...
## mtry = 8     OOB error = 95.83211 
## -0.01747426 0.05

Optimized Country Model

## [1] "country_RF_2 RMSE: 9.50399215928269"
## [1] "country_RF_2 NRMSE: 0.120303698218768"

Country Song Popularity vs. Instrumentalness

Country Song Popularity vs. Loudness

Country Model Testing

## [1] "country_RF_r test RMSE: 9.66801352692891"
## [1] "country_RF_2 test NRMSE: 0.117902603986938"

Conclusions/Future Work

Conclusions

Our preliminary goal was to create a model that had a NMRSE below 15%. We succeded in this goal, but have some ideas for future steps to further increase the accuracy